Communicating Research Findings

PSCI 2270 - Week 13

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

November 19, 2024

Plan for this week



  1. Communicating research findings

  2. Let’s make our workhorses with R

  3. Presentations

  4. Q&A on projects or class

Communicating research findings

The issue with communication


  • There is a lot of information in the world

    • There is even a lot of information in your individual projects… TOO MUCH
  • Your have two tasks

    1. Understand what results you have: Descriptive, observational, experimental
    2. Decide how to present your results: Simplify and tell a story with your data
  • In the end you will have a result from the data AND would want others to understand and believe in it

How we usually communicate

Two main approaches: Tables and Graphs

  • First can be more informative for advanced audience (who understand point estimates and standard errors)
  • Second can be more informative to general audience
  • Sometimes it is not enough to just look at the table! \(\Rightarrow\) Always plot your data!

Anscombe’s Quartet


  • F.J. Anscombe (1973): “…make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.”
as_tibble(anscombe)
# A tibble: 11 × 8
      x1    x2    x3    x4    y1    y2    y3    y4
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    10    10    10     8  8.04  9.14  7.46  6.58
 2     8     8     8     8  6.95  8.14  6.77  5.76
 3    13    13    13     8  7.58  8.74 12.7   7.71
 4     9     9     9     8  8.81  8.77  7.11  8.84
 5    11    11    11     8  8.33  9.26  7.81  8.47
 6    14    14    14     8  9.96  8.1   8.84  7.04
 7     6     6     6     8  7.24  6.13  6.08  5.25
 8     4     4     4    19  4.26  3.1   5.39 12.5 
 9    12    12    12     8 10.8   9.13  8.15  5.56
10     7     7     7     8  4.82  7.26  6.42  7.91
11     5     5     5     8  5.68  4.74  5.73  6.89

Anscombe’s Quartet: Estimates 😎

  • There are four studies. Let’s look at the statistical relationship between \(X\) and \(Y\) for each of them
lm(y1 ~ x1, data = anscombe)

Call:
lm(formula = y1 ~ x1, data = anscombe)

Coefficients:
(Intercept)           x1  
     3.0001       0.5001  
lm(y3 ~ x3, data = anscombe)

Call:
lm(formula = y3 ~ x3, data = anscombe)

Coefficients:
(Intercept)           x3  
     3.0025       0.4997  
lm(y2 ~ x2, data = anscombe)

Call:
lm(formula = y2 ~ x2, data = anscombe)

Coefficients:
(Intercept)           x2  
      3.001        0.500  
lm(y4 ~ x4, data = anscombe)

Call:
lm(formula = y4 ~ x4, data = anscombe)

Coefficients:
(Intercept)           x4  
     3.0017       0.4999  
  • Note: We can also estimate linear correlation by either removing intercept y1 ~ -1 + x1 or by using cor() function

Anscombe’s Quartet: Long format 👌

  • Long format is useful for ggplot2: Stack \(X\)’s and \(Y\)’s for each study on top of each other

    • In tidyverse we also call this long format tidy (which is where the tidy-verse is coming from)
as_tibble(anscombe_tidy)
# A tibble: 44 × 4
   study    id     x     y
   <chr> <int> <dbl> <dbl>
 1 1         1    10  8.04
 2 2         1    10  9.14
 3 3         1    10  7.46
 4 4         1     8  6.58
 5 1         2     8  6.95
 6 2         2     8  8.14
 7 3         2     8  6.77
 8 4         2     8  5.76
 9 1         3    13  7.58
10 2         3    13  8.74
# ℹ 34 more rows
lm(y ~ x, data = anscombe_tidy)

Call:
lm(formula = y ~ x, data = anscombe_tidy)

Coefficients:
(Intercept)            x  
     3.0013       0.4999  

Anscombe’s Quartet: Pooled plot 🙆‍♂️

ggplot(data = anscombe_tidy, 
       mapping = aes(x = x, y = y)) +
  geom_smooth(method = lm, se = FALSE, color = "grey") +
  geom_point() + 
  coord_equal() +
  theme_bw()

Anscombe’s Quartet: Split plot 🤯

ggplot(data = anscombe_tidy, 
       mapping = aes(x = x, y = y, color = study)) +
  geom_smooth(method = lm, se = FALSE, color = "grey") +
  geom_point() +
  coord_equal() +
  facet_wrap(~ study) +
  theme_bw()

Anscombe 2.0 Cairo; Matejka & Fitzmaurice


Corruption and Human Development



  • What general patterns can we get from this plot?

Rigid middle class



  • What general patterns can we get from this plot?

Differences in aspirations in 2021


  • What general patterns can we get from this plot?

King (2006)



All tables and figures should be separately and fully documented. Someone reading only them, without the paper, should be able to understand what is going on. Adding an explanatory paragraph at the bottom of each figure or table is usually necessary to accomplish this. Similarly, someone who reads the paper and ignores the table or figure should also be able to follow it all. The point of the text is to walk the reader by the hand through the table or figure so it is easy to understand. Picking out one number in the table and explaining it in detail at the outset as an example is often a good strategy.

Graphing types


  1. Position along a common scale: Spatial location along a common baseline to represent data

    • Example: Bar chart where the length of each bar represents a certain value, and each bar starts from the same baseline.
  1. Position along non-aligned scales: Position is still used, but the baselines differ

    • Example: Grouped bar chart where each group has a different baseline
  1. Length, direction, angle: Comparisons based on length are fairly accurate. The direction is slightly less so, and the angle even less.

    • Example: A stacked bar chart (length) or a pie chart (angle)

Graphing types


  1. Area: Area is less accurately perceived than the above

    • Example: Circle size in a bubble chart
  1. Volume and curvature: These are difficult for us to evaluate accurately

    • Example: 3D charts that use volume to represent data
  1. Shading, and colour saturation: These are the least accurately perceived

    • Example: Heatmap

Hierarchy by Munzner (2014)


  • How would you test this?

Experimental evidence (!!)

  • Cleveland and McGill (1984): Foundational experiments
  • Heer and Bostock (2010): Using Amazon MTurk crowd-sourcing sample
  • Davis et al. (2022): Account for differences in respondent characteristics
  • Position > angle ? area ? volume \(\Rightarrow\) Position rules!

Common issues with graphs

  • Not enough information (e.g. improper or missing labels)
  • Hard to compare data (e.g. using 3D shapes)
  • Being intransparent (e.g. about scales)
  • Cutting data (e.g. use binary indicators instead of averages)
  • Too cluttered/Too much data
  • More examples here

Which plots we use

  • Workhorses

    • Histograms

    • Scatterplots

    • Time trends

    • Dot-whisker / Box plots

  • …Ponies
  • …Unicorns
Code
gapminder |> 
  ggplot2::ggplot(
    mapping = aes(x = gdpPercap)) +
  ggplot2::geom_histogram(bins = 30, fill = "lightgrey", color = "black") +
  ggplot2::labs(title = "Histogram of GDP Per Capita") +
  ggplot2::theme_minimal()

Code
gapminder |> 
  ggplot2::ggplot(
    mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
  ggplot2::geom_point(alpha = 0.2) +
  ggplot2::stat_smooth(se = FALSE) +
  ggplot2::scale_x_log10() +
  ggplot2::labs(
    title = "Life Expectancy vs GDP Per Capita by Continent",
    x = "log(gdpPercap)") +
  ggplot2::theme_minimal()

Code
gapminder |> 
  dplyr::group_by(continent, year) |> 
  dplyr::summarise(lifeExp = mean(lifeExp, na.rm = TRUE)) |> 
  dplyr::mutate(type = "Continent average") |> 
  dplyr::bind_rows(gapminder) |> 
  dplyr::mutate(type = ifelse(is.na(type), "Country", type),
                country = ifelse(is.na(country), continent, country)) |> 
  ggplot2::ggplot(
    mapping = aes(x = year, y = lifeExp, 
                  group = country, color = type, alpha = type)) +
  ggplot2::geom_line(linewidth = 0.8, show.legend = FALSE) +
  ggplot2::facet_wrap(~continent, ncol = 2) +
  ggplot2::scale_alpha_manual(values = c(1, .1)) +
  ggplot2::labs(
    title = "Trends in Life Expectancy by Continent") + 
  ggplot2::theme_minimal() 

Code
gapminder |> 
  ggplot2::ggplot(
    mapping = aes(x = continent, y = lifeExp)) +
  ggplot2::geom_boxplot(outlier.colour = "hotpink") +
  ggplot2::geom_jitter(position = position_jitter(width = 0.1, height = 0), 
                       alpha = 0.25) +
  ggplot2::labs(title = "Box Plot of Life Expectancy by Continent") +
  ggplot2::theme_minimal()

Let’s make our workhorses with R

Recap on tidyverse



  • Remember our friend tidyverse package?

    • It works best with so-called tidy datasets
    • It provides full suit of functions that allow us to load (readr and haven), manipulate (dplyr and tidyr packages) and plot (ggplot2) our data
    • It presumes we are using pipelines, %>% or |> to feed left-hand side expression into right-hand side expression
  • We will skip the loading of data today and just learn how to quickly feed tidy data to ggplot2

tidy data



tidy long data



Install and load packages


  • To start off we need to install and load our packages
# this will install packages (you only do this once)

install.packages("tidyverse") # this package loads all other packages we need
install.packages("socviz") # we will use data from this package
  • Load the packages you installed
# this will LOAD packages (you do this every time you work with it)

library("tidyverse")
library("gapminder")
  • Load the data we need (gss_sm)
# this will load the dataset gss_sm from socviz package

data("gss_sm")
  • We can look at the help file for the dataset using ?
# this will open help page for the dataset in RStudio

?gss_sm

Let’s look at the data

  • To look at the data you just need to call the name of dataset
gss_sm
# A tibble: 2,867 × 32
    year    id ballot       age childs sibs   degree race  sex   region income16
   <dbl> <dbl> <labelled> <dbl>  <dbl> <labe> <fct>  <fct> <fct> <fct>  <fct>   
 1  2016     1 1             47      3 2      Bache… White Male  New E… $170000…
 2  2016     2 2             61      0 3      High … White Male  New E… $50000 …
 3  2016     3 3             72      2 3      Bache… White Male  New E… $75000 …
 4  2016     4 1             43      4 3      High … White Fema… New E… $170000…
 5  2016     5 3             55      2 2      Gradu… White Fema… New E… $170000…
 6  2016     6 2             53      2 2      Junio… White Fema… New E… $60000 …
 7  2016     7 1             50      2 2      High … White Male  New E… $170000…
 8  2016     8 3             23      3 6      High … Other Fema… Middl… $30000 …
 9  2016     9 1             45      3 5      High … Black Male  Middl… $60000 …
10  2016    10 3             71      4 1      Junio… White Male  Middl… $60000 …
# ℹ 2,857 more rows
# ℹ 21 more variables: relig <fct>, marital <fct>, padeg <fct>, madeg <fct>,
#   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
#   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
#   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
#   bigregion <fct>, partners_rc <fct>, obama <dbl>
  • We can also print all the names of the variables using names()
names(gss_sm)
 [1] "year"        "id"          "ballot"      "age"         "childs"     
 [6] "sibs"        "degree"      "race"        "sex"         "region"     
[11] "income16"    "relig"       "marital"     "padeg"       "madeg"      
[16] "partyid"     "polviews"    "happy"       "partners"    "grass"      
[21] "zodiac"      "pres12"      "wtssall"     "income_rc"   "agegrp"     
[26] "ageq"        "siblings"    "kids"        "religion"    "bigregion"  
[31] "partners_rc" "obama"      

Let’s look at the data


  • We can also and look at some observations of specific variable using $ and []
gss_sm$bigregion[1:10]
 [1] Northeast Northeast Northeast Northeast Northeast Northeast Northeast
 [8] Northeast Northeast Northeast
Levels: Northeast Midwest South West
gss_sm$obama[1:10]
 [1]  0  1  0  0  1  1 NA NA NA  0
  • Or even produce some summaries of our variables using table() or summary()
table(gss_sm$grass, useNA = "ifany")

    Legal Not Legal      <NA> 
     1126       717      1024 
summary(gss_sm$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  18.00   34.00   49.00   49.16   62.00   89.00      10 

Preparing data: Selecting


gss_sm |>
  select(bigregion, race, age, obama)
# A tibble: 2,867 × 4
   bigregion race    age obama
   <fct>     <fct> <dbl> <dbl>
 1 Northeast White    47     0
 2 Northeast White    61     1
 3 Northeast White    72     0
 4 Northeast White    43     0
 5 Northeast White    55     1
 6 Northeast White    53     1
 7 Northeast White    50    NA
 8 Northeast Other    23    NA
 9 Northeast Black    45    NA
10 Northeast White    71     0
# ℹ 2,857 more rows

Preparing data: Filtering


gss_sm |>
  select(bigregion, race, age, obama) |>
  drop_na() # this is special case of filter()
# A tibble: 1,728 × 4
   bigregion race    age obama
   <fct>     <fct> <dbl> <dbl>
 1 Northeast White    47     0
 2 Northeast White    61     1
 3 Northeast White    72     0
 4 Northeast White    43     0
 5 Northeast White    55     1
 6 Northeast White    53     1
 7 Northeast White    71     0
 8 Northeast Black    32     1
 9 Northeast Black    60     1
10 Northeast White    76     0
# ℹ 1,718 more rows

Preparing data: Grouping


gss_sm |>
  select(bigregion, race, age, obama) |>
  drop_na() |> 
  group_by(bigregion, race)
# A tibble: 1,728 × 4
# Groups:   bigregion, race [12]
   bigregion race    age obama
   <fct>     <fct> <dbl> <dbl>
 1 Northeast White    47     0
 2 Northeast White    61     1
 3 Northeast White    72     0
 4 Northeast White    43     0
 5 Northeast White    55     1
 6 Northeast White    53     1
 7 Northeast White    71     0
 8 Northeast Black    32     1
 9 Northeast Black    60     1
10 Northeast White    76     0
# ℹ 1,718 more rows

Preparing data: Summarizing


gss_sm |>
  select(bigregion, race, age, obama) |>
  drop_na() |>
  group_by(bigregion, race) |>
  summarize(
    total = n(),
    obama_share = mean(obama),
    obama_se = sd(obama)/sqrt(n())
  )
# A tibble: 12 × 5
# Groups:   bigregion [4]
   bigregion race  total obama_share obama_se
   <fct>     <fct> <int>       <dbl>    <dbl>
 1 Northeast White   259       0.645   0.0298
 2 Northeast Black    39       0.974   0.0256
 3 Northeast Other    15       0.867   0.0909
 4 Midwest   White   360       0.586   0.0260
 5 Midwest   Black    62       1       0     
 6 Midwest   Other    20       0.85    0.0819
 7 South     White   404       0.413   0.0245
 8 South     Black   179       0.944   0.0172
 9 South     Other    26       0.731   0.0887
10 West      White   280       0.539   0.0298
11 West      Black    34       1       0     
12 West      Other    50       0.64    0.0686

Preparing data: Mutating


gss_sm |>
  select(bigregion, race, age, obama) |>
  drop_na() |>
  group_by(bigregion, race) |>
  summarize(
    total = n(),
    obama_share = mean(obama),
    obama_se = sd(obama)/sqrt(n())
  ) |>
  mutate(pct = total/sum(total))
# A tibble: 12 × 6
# Groups:   bigregion [4]
   bigregion race  total obama_share obama_se    pct
   <fct>     <fct> <int>       <dbl>    <dbl>  <dbl>
 1 Northeast White   259       0.645   0.0298 0.827 
 2 Northeast Black    39       0.974   0.0256 0.125 
 3 Northeast Other    15       0.867   0.0909 0.0479
 4 Midwest   White   360       0.586   0.0260 0.814 
 5 Midwest   Black    62       1       0      0.140 
 6 Midwest   Other    20       0.85    0.0819 0.0452
 7 South     White   404       0.413   0.0245 0.663 
 8 South     Black   179       0.944   0.0172 0.294 
 9 South     Other    26       0.731   0.0887 0.0427
10 West      White   280       0.539   0.0298 0.769 
11 West      Black    34       1       0      0.0934
12 West      Other    50       0.64    0.0686 0.137 

ggplot2 thinking


https://visualizingsociety.com/

ggplot(data = [DATA], 
       mapping = aes([MAPPINGS])) +                # required
 [GEOM_FUNCTION]()   +                             # required
 [STAT_FUNCTION]()   +                             # not required
 [COORDINATE_FUNCTION]()  +                        # not required
 [SCALE_FUNCTION]()  +                             # not required
 [LABELS_FUNCTION]() +                             # not required
 [THEME_FUNCTION]()                                # not required

Bar chart: Age/race and support for legalizing


  • How would we expect this plot to look like?

  • Can we break creating this plot into steps?

  • Helpful questions to yourself

    • Are we plotting raw data?
    • Do we need to mutate data?
    • Do we need to summarize data in some way?
    • What is the geom(etry) we need to use?

Bar chart: Age/race and support for legalizing


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp) |>
  summarize(grass_share = mean(grass == "Legal"))
# A tibble: 5 × 2
  agegrp    grass_share
  <fct>           <dbl>
1 Age 18-35       0.702
2 Age 35-45       0.664
3 Age 45-55       0.543
4 Age 55-65       0.657
5 Age 65+         0.447

Bar chart: ggplot setup


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp) |>
  summarize(grass_share = mean(grass == "Legal")) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share))

Bar chart: Add bars/columns


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp) |>
  summarize(grass_share = mean(grass == "Legal")) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share)) +
  geom_col()

Bar chart: Add fill colors


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp) |>
  summarize(grass_share = mean(grass == "Legal")) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share, 
                       fill = agegrp)) +
  geom_col()

Bar chart: Manage guides


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp) |>
  summarize(grass_share = mean(grass == "Legal")) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share, 
                       fill = agegrp)) +
  geom_col() +
  scale_fill_brewer(type = "qual")

Bar chart: Manage guides


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp) |>
  summarize(grass_share = mean(grass == "Legal")) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share, 
                       fill = agegrp)) +
  geom_col() +
  scale_fill_brewer(type = "qual", guide = "none")

Bar chart: Add labels


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp) |>
  summarize(grass_share = mean(grass == "Legal")) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share, 
                       fill = agegrp)) +
  geom_col() +
  scale_fill_brewer(type = "qual", guide = "none") + 
  labs(x = NULL, y = "Share")

Bar chart: Split plots


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp, race) |>
  summarize(grass_share = mean(grass == "Legal")) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share, 
                       fill = agegrp)) +
  geom_col() +
  scale_fill_brewer(type = "qual", guide = "none") + 
  labs(x = NULL, y = "Share") +
  facet_wrap(~ race, nrow = 1)

Bar chart: Add theme


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp, race) |>
  summarize(grass_share = mean(grass == "Legal")) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share, 
                       fill = agegrp)) +
  geom_col() +
  scale_fill_brewer(type = "qual", guide = "none") + 
  labs(x = NULL, y = "Share") +
  facet_wrap(~ race, nrow = 1) +
  theme_bw()

Bar chart: Add error bars


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp, race) |>
  summarize(grass_share = mean(grass == "Legal"),
            grass_se = sd(grass == "Legal") / sqrt(n())) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share, 
                       fill = agegrp)) +
  geom_col() +
  geom_errorbar(aes(ymin = grass_share - 2 * grass_se, 
                    ymax = grass_share + 2 * grass_se), 
                width = .1) +
  scale_fill_brewer(type = "qual", guide = "none") + 
  labs(x = NULL, y = "Share") +
  facet_wrap(~ race, nrow = 1) +
  theme_bw()

Bar chart: Enhance theme


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp, race) |>
  summarize(grass_share = mean(grass == "Legal"),
            grass_se = sd(grass == "Legal") / sqrt(n())) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share, 
                       fill = agegrp)) +
  geom_col() +
  geom_errorbar(aes(ymin = grass_share - 2 * grass_se, 
                    ymax = grass_share + 2 * grass_se), 
                width = .1) +
  scale_fill_brewer(palette = "Greens", 
                    direction = -1, 
                    guide = "none") + 
  labs(x = NULL, y = "Share") +
  facet_wrap(~ race, nrow = 1) +
  # theme_bw() +
  hrbrthemes::theme_ipsum(grid = "Y", 
                          axis_title_size = 14)

Bar chart: Fix x-axis


gss_sm |>
  select(race, agegrp, grass) |>
  drop_na() |>
  group_by(agegrp, race) |>
  summarize(grass_share = mean(grass == "Legal"),
            grass_se = sd(grass == "Legal") / sqrt(n())) |>
  ggplot(mapping = aes(x = agegrp, 
                       y = grass_share, 
                       fill = agegrp)) +
  geom_col() +
  geom_errorbar(aes(ymin = grass_share - 2 * grass_se, 
                    ymax = grass_share + 2 * grass_se), 
                width = .1) +
  scale_fill_brewer(palette = "Greens", 
                    direction = -1, 
                    guide = "none") + 
  labs(x = NULL, y = "Share") +
  facet_wrap(~ race, nrow = 1) +
  # theme_bw() +
  hrbrthemes::theme_ipsum(grid = "Y", 
                          axis_title_size = 14) +
  theme(
    axis.text.x = element_text(angle = 75, vjust = .2)
  )

Scatterplot: Gender gap across regions and age


  • How would we expect this plot to look like?

  • Can we break creating this plot into steps?

Scatterplot: Gender gap across regions and age


gss_sm |>
  select(bigregion, sex, income16, age) |>
  drop_na()
# A tibble: 2,589 × 4
   bigregion sex    income16           age
   <fct>     <fct>  <fct>            <dbl>
 1 Northeast Male   $170000 or over     47
 2 Northeast Male   $50000 to 59999     61
 3 Northeast Male   $75000 to $89999    72
 4 Northeast Female $170000 or over     43
 5 Northeast Female $170000 or over     55
 6 Northeast Female $60000 to 74999     53
 7 Northeast Male   $170000 or over     50
 8 Northeast Female $30000 to 34999     23
 9 Northeast Male   $60000 to 74999     45
10 Northeast Male   $60000 to 74999     71
# ℹ 2,579 more rows

Scatterplot: ggplot setup


gss_sm |>
  select(bigregion, sex, income16, age) |>
  drop_na() |>
  ggplot(mapping = aes(x = age, 
                       y = as.integer(income16)))

Scatterplot: Add points


gss_sm |>
  select(bigregion, sex, income16, age) |>
  drop_na() |>
  ggplot(mapping = aes(x = age,
                       y = as.integer(income16))) +
  geom_point(alpha = .2)

Scatterplot: Add trend


gss_sm |>
  select(bigregion, sex, income16, age) |>
  drop_na() |>
  ggplot(mapping = aes(x = age,
                       y = as.integer(income16))) +
  geom_point(alpha = .2) +
  stat_smooth(se = FALSE)

Scatterplot: Add labels


gss_sm |>
  select(bigregion, sex, income16, age) |>
  drop_na() |>
  ggplot(mapping = aes(x = age,
                       y = as.integer(income16))) +
  geom_point(alpha = .2) +
  stat_smooth(se = FALSE) +
  labs(x = "Age", y = "Income Category")

Scatterplot: Split plots


gss_sm |>
  select(bigregion, sex, income16, age) |>
  drop_na() |>
  ggplot(mapping = aes(x = age,
                       y = as.integer(income16))) +
  geom_point(alpha = .2) +
  stat_smooth(se = FALSE) +
  labs(x = "Age", y = "Income Category") +
  # creates two-way grid of plots
  facet_grid(sex ~ bigregion) 

Scatterplot: Add theme


gss_sm |>
  select(bigregion, sex, income16, age) |>
  drop_na() |>
  ggplot(mapping = aes(x = age,
                       y = as.integer(income16))) +
  geom_point(alpha = .2) +
  stat_smooth(se = TRUE) +
  labs(x = "Age", y = "Income Category") +
  facet_grid(bigregion ~ sex) +
  theme_bw()

Dot-whisker plot: Was Obama vote different across race and regions?


  • How would we expect this plot to look like?

  • Can we break creating this plot into steps?

Dot-whisker plot: Was Obama vote different across race and regions?


gss_sm |>
  select(bigregion, race, obama) |>
  # instead of drop_na() we can use filter()
  filter(!is.na(obama)) 
# A tibble: 1,730 × 3
   bigregion race  obama
   <fct>     <fct> <dbl>
 1 Northeast White     0
 2 Northeast White     1
 3 Northeast White     0
 4 Northeast White     0
 5 Northeast White     1
 6 Northeast White     1
 7 Northeast White     0
 8 Northeast Black     1
 9 Northeast Black     1
10 Northeast White     0
# ℹ 1,720 more rows

Dot-whisker plot: ggplot setup


gss_sm |>
  select(bigregion, race, obama) |>
  filter(!is.na(obama)) |>
  ggplot(mapping = aes(x = obama,
                       y = bigregion,
                       color = bigregion))

Dot-whisker plot: Add points


gss_sm |>
  select(bigregion, race, obama) |>
  filter(!is.na(obama)) |>
  ggplot(mapping = aes(x = obama,
                       y = bigregion,
                       color = bigregion)) +
  geom_point(alpha = .3)

Dot-whisker plot: Jitter points


gss_sm |>
  select(bigregion, race, obama) |>
  filter(!is.na(obama)) |>
  ggplot(mapping = aes(x = obama,
                       y = bigregion,
                       color = bigregion)) +
  geom_jitter(width = .05, alpha = .3)

Dot-whisker plot: Add data summary


gss_sm |>
  select(bigregion, race, obama) |>
  filter(!is.na(obama)) |>
  ggplot(mapping = aes(x = obama,
                       y = bigregion,
                       color = bigregion)) +
  geom_jitter(width = .05, alpha = .3) +
  stat_summary(fun.data = "mean_cl_normal",
               color = "black")

Dot-whisker plot: Manage guides


gss_sm |>
  select(bigregion, race, obama) |>
  filter(!is.na(obama)) |>
  ggplot(mapping = aes(x = obama,
                       y = bigregion,
                       color = bigregion)) +
  geom_jitter(width = .05, alpha = .3) +
  stat_summary(fun.data = "mean_cl_normal",
               color = "black") +
  scale_color_brewer(palette = "Dark2",
                     guide = "none")

Dot-whisker plot: Add labels


gss_sm |>
  select(bigregion, race, obama) |>
  filter(!is.na(obama)) |>
  ggplot(mapping = aes(x = obama,
                       y = bigregion,
                       color = bigregion)) +
  geom_jitter(width = .05, alpha = .3) +
  stat_summary(fun.data = "mean_cl_normal",
               color = "black") +
  scale_color_brewer(palette = "Dark2",
                     guide = "none") + 
  labs(x = "Voted for Obama", y = NULL,
       title = "Obama vote shares by region")

Dot-whisker plot: Split plot


gss_sm |>
  select(bigregion, race, obama) |>
  filter(!is.na(obama)) |>
  ggplot(mapping = aes(x = obama,
                       y = bigregion,
                       color = bigregion)) +
  geom_jitter(width = .05, alpha = .3) +
  stat_summary(fun.data = "mean_cl_normal",
               color = "black") +
  scale_color_brewer(palette = "Dark2",
                     guide = "none") + 
  labs(x = "Voted for Obama", y = NULL,
       title = "Obama vote shares by region and race") +
  facet_wrap(~ race, ncol = 1)

Dot-whisker plot: Add theme


gss_sm |>
  select(bigregion, race, obama) |>
  filter(!is.na(obama)) |>
  ggplot(mapping = aes(x = obama,
                       y = bigregion,
                       color = bigregion)) +
  geom_jitter(width = .05, alpha = .3) +
  stat_summary(fun.data = "mean_cl_normal",
               color = "black") +
  scale_color_brewer(palette = "Dark2",
                     guide = "none") + 
  labs(x = "Voted for Obama", y = NULL,
       title = "Obama vote shares by region and race") +
  facet_wrap(~ race, ncol = 1) +
  theme_bw()

Resources



Presentations after Thanksgiving

Reminders


  • Final presentations are during class after Thanksgiving

    • Each of you will have 5-7 minutes for presentation with 3-5 minutes of feedback
    • Sign up link will be sent tonight
    • There will be pizza and drinks!
  • Final PAPs are due on OSF (and link to it on Brightspace) by December 10
  • Q&A about PAPs, presentations and class in general on Thursday

Final presentations


  • 5-7 minutes \(\approx\) 5-7 slides
  • Do not put too much text on the slides:

    • \(6 \times 6\) rule: Unless absloutely unavoidable don’t put more than 6 words and 6 lines per slide
    • Try not to read from slides
  • What to include: Motivation, Research Question, Hypotheses, Research Design (Context/Unit of analysis/Experiment or Observational), Independent/Dependent variables, Measurement and procedures, Possible issues

References

Cleveland, William S., and Robert McGill. 1984. “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Journal of the American Statistical Association 79 (387): 531–54. https://doi.org/10.1080/01621459.1984.10478080.
Davis, Russell, Xiaoying Pu, Yiren Ding, Brian D. Hall, Karen Bonilla, Mi Feng, Matthew Kay, and Lane Harrison. 2022. “The Risks of Ranking: Revisiting Graphical Perception to Model Individual Differences in Visualization Performance.” IEEE Transactions on Visualization and Computer Graphics, 1–16. https://doi.org/10.1109/TVCG.2022.3226463.
Heer, Jeffrey, and Michael Bostock. 2010. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 203–12.
King, Gary. 2006. “Publication, Publication.” PS: Political Science & Politics 39 (1): 119–25.
Munzner, Tamara. 2014. Visualization Analysis and Design. CRC press.